Evolutionary Conservation
   HOME

TheInfoList



OR:

In
evolutionary biology Evolutionary biology is the subfield of biology that studies the evolutionary processes (natural selection, common descent, speciation) that produced the diversity of life on Earth. It is also defined as the study of the history of life fo ...
, conserved sequences are identical or similar
sequences In mathematics, a sequence is an enumerated collection of objects in which repetitions are allowed and order matters. Like a set, it contains members (also called ''elements'', or ''terms''). The number of elements (possibly infinite) is called t ...
in
nucleic acid Nucleic acids are biopolymers, macromolecules, essential to all known forms of life. They are composed of nucleotides, which are the monomers made of three components: a 5-carbon sugar, a phosphate group and a nitrogenous base. The two main cl ...
s ( DNA and
RNA Ribonucleic acid (RNA) is a polymeric molecule essential in various biological roles in coding, decoding, regulation and expression of genes. RNA and deoxyribonucleic acid ( DNA) are nucleic acids. Along with lipids, proteins, and carbohydra ...
) or
proteins Proteins are large biomolecules and macromolecules that comprise one or more long chains of amino acid residues. Proteins perform a vast array of functions within organisms, including catalysing metabolic reactions, DNA replication, respo ...
across species ( orthologous sequences), or within a
genome In the fields of molecular biology and genetics, a genome is all the genetic information of an organism. It consists of nucleotide sequences of DNA (or RNA in RNA viruses). The nuclear genome includes protein-coding genes and non-coding ge ...
( paralogous sequences), or between donor and receptor taxa ( xenologous sequences). Conservation indicates that a sequence has been maintained by
natural selection Natural selection is the differential survival and reproduction of individuals due to differences in phenotype. It is a key mechanism of evolution, the change in the heritable traits characteristic of a population over generations. Charle ...
. A highly conserved sequence is one that has remained relatively unchanged far back up the
phylogenetic tree A phylogenetic tree (also phylogeny or evolutionary tree Felsenstein J. (2004). ''Inferring Phylogenies'' Sinauer Associates: Sunderland, MA.) is a branching diagram or a tree showing the evolutionary relationships among various biological spec ...
, and hence far back in
geological time The geologic time scale, or geological time scale, (GTS) is a representation of time based on the rock record of Earth. It is a system of chronological dating that uses chronostratigraphy (the process of relating strata to time) and geochron ...
. Examples of highly conserved sequences include the RNA components of
ribosome Ribosomes ( ) are macromolecular machines, found within all cells, that perform biological protein synthesis (mRNA translation). Ribosomes link amino acids together in the order specified by the codons of messenger RNA (mRNA) molecules to ...
s present in all
domain Domain may refer to: Mathematics *Domain of a function, the set of input values for which the (total) function is defined **Domain of definition of a partial function **Natural domain of a partial function **Domain of holomorphy of a function * Do ...
s of life, the
homeobox A homeobox is a DNA sequence, around 180 base pairs long, that regulates large-scale anatomical features in the early stages of embryonic development. For instance, mutations in a homeobox may change large-scale anatomical features of the full- ...
sequences widespread amongst
Eukaryotes Eukaryotes () are organisms whose cells have a nucleus. All animals, plants, fungi, and many unicellular organisms, are Eukaryotes. They belong to the group of organisms Eukaryota or Eukarya, which is one of the three domains of life. Bacte ...
, and the
tmRNA Transfer-messenger RNA (abbreviated tmRNA, also known as 10Sa RNA and by its genetic name SsrA) is a bacterial RNA molecule with dual tRNA-like and messenger RNA-like properties. The tmRNA forms a ribonucleoprotein complex (tmRNP) together with ...
in
Bacteria Bacteria (; singular: bacterium) are ubiquitous, mostly free-living organisms often consisting of one biological cell. They constitute a large domain of prokaryotic microorganisms. Typically a few micrometres in length, bacteria were among ...
. The study of sequence conservation overlaps with the fields of
genomics Genomics is an interdisciplinary field of biology focusing on the structure, function, evolution, mapping, and editing of genomes. A genome is an organism's complete set of DNA, including all of its genes as well as its hierarchical, three-dim ...
,
proteomics Proteomics is the large-scale study of proteins. Proteins are vital parts of living organisms, with many functions such as the formation of structural fibers of muscle tissue, enzymatic digestion of food, or synthesis and replication of DNA. In ...
,
evolutionary biology Evolutionary biology is the subfield of biology that studies the evolutionary processes (natural selection, common descent, speciation) that produced the diversity of life on Earth. It is also defined as the study of the history of life fo ...
,
phylogenetics In biology, phylogenetics (; from Greek language, Greek wikt:φυλή, φυλή/wikt:φῦλον, φῦλον [] "tribe, clan, race", and wikt:γενετικός, γενετικός [] "origin, source, birth") is the study of the evolutionary his ...
,
bioinformatics Bioinformatics () is an interdisciplinary field that develops methods and software tools for understanding biological data, in particular when the data sets are large and complex. As an interdisciplinary field of science, bioinformatics combi ...
and mathematics.


History

The discovery of the role of DNA in
heredity Heredity, also called inheritance or biological inheritance, is the passing on of traits from parents to their offspring; either through asexual reproduction or sexual reproduction, the offspring cells or organisms acquire the genetic inform ...
, and observations by
Frederick Sanger Frederick Sanger (; 13 August 1918 – 19 November 2013) was an English biochemist who received the Nobel Prize in Chemistry twice. He won the 1958 Chemistry Prize for determining the amino acid sequence of insulin and numerous other p ...
of variation between animal
insulin Insulin (, from Latin ''insula'', 'island') is a peptide hormone produced by beta cells of the pancreatic islets encoded in humans by the ''INS'' gene. It is considered to be the main anabolic hormone of the body. It regulates the metabolism o ...
s in 1949, prompted early molecular biologists to study
taxonomy Taxonomy is the practice and science of categorization or classification. A taxonomy (or taxonomical classification) is a scheme of classification, especially a hierarchical classification, in which things are organized into groups or types. ...
from a molecular perspective. Studies in the 1960s used DNA hybridization and protein cross-reactivity techniques to measure similarity between known
orthologous Sequence homology is the biological homology between DNA, RNA, or protein sequences, defined in terms of shared ancestry in the evolutionary history of life. Two segments of DNA can have shared ancestry because of three phenomena: either a s ...
proteins, such as
hemoglobin Hemoglobin (haemoglobin BrE) (from the Greek word αἷμα, ''haîma'' 'blood' + Latin ''globus'' 'ball, sphere' + ''-in'') (), abbreviated Hb or Hgb, is the iron-containing oxygen-transport metalloprotein present in red blood cells (erythrocyte ...
and
cytochrome c The cytochrome complex, or cyt ''c'', is a small hemeprotein found loosely associated with the inner membrane of the mitochondrion. It belongs to the cytochrome c family of proteins and plays a major role in cell apoptosis. Cytochrome c is hig ...
. In 1965, Émile Zuckerkandl and
Linus Pauling Linus Carl Pauling (; February 28, 1901August 19, 1994) was an American chemist, biochemist, chemical engineer, peace activist, author, and educator. He published more than 1,200 papers and books, of which about 850 dealt with scientific top ...
introduced the concept of the
molecular clock The molecular clock is a figurative term for a technique that uses the mutation rate of biomolecules to deduce the time in prehistory when two or more life forms diverged. The biomolecular data used for such calculations are usually nucleoti ...
, proposing that steady rates of amino acid replacement could be used to estimate the time since two organisms diverged. While initial phylogenies closely matched the
fossil record A fossil (from Classical Latin , ) is any preserved remains, impression, or trace of any once-living thing from a past geological age. Examples include bones, shells, exoskeletons, stone imprints of animals or microbes, objects preserved in ...
, observations that some genes appeared to evolve at different rates led to the development of theories of
molecular evolution Molecular evolution is the process of change in the sequence composition of cellular molecules such as DNA, RNA, and proteins across generations. The field of molecular evolution uses principles of evolutionary biology and population genetics ...
. Margaret Dayhoff's 1966 comparison of
ferrodoxin Ferredoxins (from Latin ''ferrum'': iron + redox, often abbreviated "fd") are iron–sulfur proteins that mediate electron transfer in a range of metabolic reactions. The term "ferredoxin" was coined by D.C. Wharton of the DuPont Co. and applied to ...
sequences showed that
natural selection Natural selection is the differential survival and reproduction of individuals due to differences in phenotype. It is a key mechanism of evolution, the change in the heritable traits characteristic of a population over generations. Charle ...
would act to conserve and optimise protein sequences essential to life.


Mechanisms

Over many generations, nucleic acid sequences in the
genome In the fields of molecular biology and genetics, a genome is all the genetic information of an organism. It consists of nucleotide sequences of DNA (or RNA in RNA viruses). The nuclear genome includes protein-coding genes and non-coding ge ...
of an evolutionary lineage can gradually change over time due to random mutations and deletions. Sequences may also recombine or be deleted due to
chromosomal rearrangement In genetics, a chromosomal rearrangement is a mutation that is a type of chromosome abnormality involving a change in the structure of the native chromosome. Such changes may involve several different classes of events, like deletions, duplicatio ...
s. Conserved sequences are sequences which persist in the genome despite such forces, and have slower rates of mutation than the background mutation rate. Conservation can occur in coding and
non-coding Non-coding DNA (ncDNA) sequences are components of an organism's DNA that do not encode protein sequences. Some non-coding DNA is transcribed into functional non-coding RNA molecules (e.g. transfer RNA, microRNA, piRNA, ribosomal RNA, and regula ...
nucleic acid sequences. Highly conserved DNA sequences are thought to have functional value, although the role for many highly conserved non-coding DNA sequences is poorly understood. The extent to which a sequence is conserved can be affected by varying
selection pressures Any cause that reduces or increases reproductive success in a portion of a population potentially exerts evolutionary pressure, selective pressure or selection pressure, driving natural selection. It is a quantitative description of the amount of ...
, its
robustness Robustness is the property of being strong and healthy in constitution. When it is transposed into a system, it refers to the ability of tolerating perturbations that might affect the system’s functional body. In the same line ''robustness'' ca ...
to mutation,
population size In population genetics and population ecology, population size (usually denoted ''N'') is the number of individual organisms in a population. Population size is directly associated with amount of genetic drift, and is the underlying cause of effect ...
and
genetic drift Genetic drift, also known as allelic drift or the Wright effect, is the change in the frequency of an existing gene variant (allele) in a population due to random chance. Genetic drift may cause gene variants to disappear completely and there ...
. Many functional sequences are also
modular Broadly speaking, modularity is the degree to which a system's components may be separated and recombined, often with the benefit of flexibility and variety in use. The concept of modularity is used primarily to reduce complexity by breaking a sy ...
, containing regions which may be subject to independent
selection pressures Any cause that reduces or increases reproductive success in a portion of a population potentially exerts evolutionary pressure, selective pressure or selection pressure, driving natural selection. It is a quantitative description of the amount of ...
, such as
protein domains In molecular biology, a protein domain is a region of a protein's polypeptide chain that is self-stabilizing and that folds independently from the rest. Each domain forms a compact folded three-dimensional structure. Many proteins consist of s ...
.


Coding sequence

In coding sequences, the nucleic acid and amino acid sequence may be conserved to different extents, as the degeneracy of the
genetic code The genetic code is the set of rules used by living cells to translate information encoded within genetic material ( DNA or RNA sequences of nucleotide triplets, or codons) into proteins. Translation is accomplished by the ribosome, which links ...
means that
synonymous mutations A synonymous substitution (often called a ''silent'' substitution though they are not always silent) is the evolutionary substitution of one base for another in an exon of a gene coding for a protein, such that the produced amino acid sequence ...
in a coding sequence do not affect the amino acid sequence of its protein product. Amino acid sequences can be conserved to maintain the
structure A structure is an arrangement and organization of interrelated elements in a material object or system, or the object or system so organized. Material structures include man-made objects such as buildings and machines and natural objects such as ...
or function of a protein or domain. Conserved proteins undergo fewer
amino acid replacement Amino acid replacement is a change from one amino acid to a different amino acid in a protein due to point mutation in the corresponding DNA sequence. It is caused by nonsynonymous missense mutation which changes the codon sequence to code other a ...
s, or are more likely to substitute amino acids with similar biochemical properties. Within a sequence, amino acids that are important for
folding Fold, folding or foldable may refer to: Arts, entertainment, and media * ''Fold'' (album), the debut release by Australian rock band Epicure * Fold (poker), in the game of poker, to discard one's hand and forfeit interest in the current pot *Abov ...
, structural stability, or that form a
binding site In biochemistry and molecular biology, a binding site is a region on a macromolecule such as a protein that binds to another molecule with specificity. The binding partner of the macromolecule is often referred to as a ligand. Ligands may inclu ...
may be more highly conserved. The nucleic acid sequence of a protein coding gene may also be conserved by other selective pressures. The
codon usage bias Codon usage bias refers to differences in the frequency of occurrence of synonymous codons in coding DNA. A codon is a series of three nucleotides (a triplet) that encodes a specific amino acid residue in a polypeptide chain or for the terminatio ...
in some organisms may restrict the types of synonymous mutations in a sequence. Nucleic acid sequences that cause
secondary structure Protein secondary structure is the three dimensional conformational isomerism, form of ''local segments'' of proteins. The two most common Protein structure#Secondary structure, secondary structural elements are alpha helix, alpha helices and beta ...
in the mRNA of a coding gene may be selected against, as some structures may negatively affect translation, or conserved where the mRNA also acts as a functional non-coding RNA.


Non-coding

Non-coding sequences important for
gene regulation Regulation of gene expression, or gene regulation, includes a wide range of mechanisms that are used by cells to increase or decrease the production of specific gene products (protein or RNA). Sophisticated programs of gene expression are wide ...
, such as the binding or recognition sites of
ribosomes Ribosomes ( ) are macromolecular machines, found within all cells, that perform biological protein synthesis (mRNA translation). Ribosomes link amino acids together in the order specified by the codons of messenger RNA (mRNA) molecules to f ...
and
transcription factor In molecular biology, a transcription factor (TF) (or sequence-specific DNA-binding factor) is a protein that controls the rate of transcription of genetic information from DNA to messenger RNA, by binding to a specific DNA sequence. The fu ...
s, may be conserved within a genome. For example, the promoter of a conserved gene or
operon In genetics, an operon is a functioning unit of DNA containing a cluster of genes under the control of a single promoter. The genes are transcribed together into an mRNA strand and either translated together in the cytoplasm, or undergo splic ...
may also be conserved. As with proteins, nucleic acids that are important for the structure and function of
non-coding RNA A non-coding RNA (ncRNA) is a functional RNA molecule that is not translated into a protein. The DNA sequence from which a functional non-coding RNA is transcribed is often called an RNA gene. Abundant and functionally important types of non-c ...
(ncRNA) can also be conserved. However, sequence conservation in ncRNAs is generally poor compared to protein-coding sequences, and
base pairs A base pair (bp) is a fundamental unit of double-stranded nucleic acids consisting of two nucleobases bound to each other by hydrogen bonds. They form the building blocks of the DNA double helix and contribute to the folded structure of both DNA ...
that contribute to structure or function are often conserved instead.


Identification

Conserved sequences are typically identified by
bioinformatics Bioinformatics () is an interdisciplinary field that develops methods and software tools for understanding biological data, in particular when the data sets are large and complex. As an interdisciplinary field of science, bioinformatics combi ...
approaches based on
sequence alignment In bioinformatics, a sequence alignment is a way of arranging the sequences of DNA, RNA, or protein to identify regions of similarity that may be a consequence of functional, structural, or evolutionary relationships between the sequences. Alig ...
. Advances in high-throughput DNA sequencing and
protein mass spectrometry Protein mass spectrometry refers to the application of mass spectrometry to the study of proteins. Mass spectrometry is an important method for the accurate mass determination and characterization of proteins, and a variety of methods and instrume ...
has substantially increased the availability of protein sequences and whole genomes for comparison since the early 2000s.


Homology search

Conserved sequences may be identified by
homology Homology may refer to: Sciences Biology *Homology (biology), any characteristic of biological organisms that is derived from a common ancestor * Sequence homology, biological homology between DNA, RNA, or protein sequences *Homologous chrom ...
search, using tools such as
BLAST Blast or The Blast may refer to: * Explosion, a rapid increase in volume and release of energy in an extreme manner *Detonation, an exothermic front accelerating through a medium that eventually drives a shock front Film * ''Blast'' (1997 film) ...
,
HMMER HMMER is a free and commonly used software package for sequence analysis written by Sean Eddy. Its general usage is to identify homologous protein or nucleotide sequences, and to perform sequence alignments. It detects homology by comparing ...

OrthologR
and Infernal. Homology search tools may take an individual nucleic acid or protein sequence as input, or use statistical models generated from
multiple sequence alignment Multiple sequence alignment (MSA) may refer to the process or the result of sequence alignment of three or more biological sequences, generally protein, DNA, or RNA. In many cases, the input set of query sequences are assumed to have an evolutio ...
s of known related sequences. Statistical models such as profile-HMMs, and RNA covariance models which also incorporate structural information, can be helpful when searching for more distantly related sequences. Input sequences are then aligned against a database of sequences from related individuals or other species. The resulting alignments are then scored based on the number of matching amino acids or bases, and the number of gaps or deletions generated by the alignment. Acceptable conservative substitutions may be identified using substitution matrices such as PAM and
BLOSUM In bioinformatics, the BLOSUM (BLOcks SUbstitution Matrix) matrix is a substitution matrix used for sequence alignment of proteins. BLOSUM matrices are used to score alignments between evolutionarily divergent protein sequences. They are based o ...
. Highly scoring alignments are assumed to be from homologous sequences. The conservation of a sequence may then be inferred by detection of highly similar homologs over a broad phylogenetic range.


Multiple sequence alignment

Multiple sequence alignments can be used to visualise conserved sequences. The
CLUSTAL Clustal is a series of widely used computer programs used in bioinformatics for multiple sequence alignment. There have been many versions of Clustal over the development of the algorithm that are listed below. The analysis of each tool and its a ...
format includes a plain-text key to annotate conserved columns of the alignment, denoting conserved sequence (*), conservative mutations (:), semi-conservative mutations (.), and non-conservative mutations ( ) Sequence logos can also show conserved sequence by representing the proportions of characters at each point in the alignment by height.


Genome alignment

Whole genome alignments (WGAs) may also be used to identify highly conserved regions across species. Currently the accuracy and
scalability Scalability is the property of a system to handle a growing amount of work by adding resources to the system. In an economic context, a scalable business model implies that a company can increase sales given increased resources. For example, a ...
of WGA tools remains limited due to the computational complexity of dealing with rearrangements, repeat regions and the large size of many eukaryotic genomes. However, WGAs of 30 or more closely related bacteria (prokaryotes) are now increasingly feasible.


Scoring systems

Other approaches use measurements of conservation based on
statistical tests A statistical hypothesis test is a method of statistical inference used to decide whether the data at hand sufficiently support a particular hypothesis. Hypothesis testing allows us to make probabilistic statements about population parameters. ...
that attempt to identify sequences which mutate differently to an expected background (neutral) mutation rate. The GERP (Genomic Evolutionary Rate Profiling) framework scores conservation of genetic sequences across species. This approach estimates the rate of neutral mutation in a set of species from a multiple sequence alignment, and then identifies regions of the sequence that exhibit fewer mutations than expected. These regions are then assigned scores based on the difference between the observed mutation rate and expected background mutation rate. A high GERP score then indicates a highly conserved sequence. LIST (Local Identity and Shared Taxa) is based on the assumption that variations observed in species closely related to human are more significant when assessing conservation compared to those in distantly related species. Thus, LIST utilizes the local alignment identity around each position to identify relevant sequences in the multiple sequence alignment (MSA) and then it estimates conservation based on the taxonomy distances of these sequences to human. Unlike other tools, LIST ignores the count/frequency of variations in the MSA. Aminode combines multiple alignments with phylogenetic analysis to analyze changes in homologous proteins and produce a plot that indicates the local rates of evolutionary changes. This approach identifies the Evolutionarily Constrained Regions in a protein, which are segments that are subject to
purifying selection In natural selection, negative selection or purifying selection is the selective removal of alleles that are deleterious. This can result in stabilising selection through the purging of deleterious genetic polymorphisms that arise through random ...
and are typically critical for normal protein function. Other approaches such as PhyloP and PhyloHMM incorporate statistical phylogenetics methods to compare
probability distribution In probability theory and statistics, a probability distribution is the mathematical function that gives the probabilities of occurrence of different possible outcomes for an experiment. It is a mathematical description of a random phenomenon i ...
s of substitution rates, which allows the detection of both conservation and accelerated mutation. First, a background probability distribution is generated of the number of substitutions expected to occur for a column in a multiple sequence alignment, based on a
phylogenetic tree A phylogenetic tree (also phylogeny or evolutionary tree Felsenstein J. (2004). ''Inferring Phylogenies'' Sinauer Associates: Sunderland, MA.) is a branching diagram or a tree showing the evolutionary relationships among various biological spec ...
. The estimated evolutionary relationships between the species of interest are used to calculate the significance of any substitutions (i.e. a substitution between two closely related species may be less likely to occur than distantly related ones, and therefore more significant). To detect conservation, a probability distribution is calculated for a subset of the multiple sequence alignment, and compared to the background distribution using a statistical test such as a
likelihood-ratio test In statistics, the likelihood-ratio test assesses the goodness of fit of two competing statistical models based on the ratio of their likelihoods, specifically one found by maximization over the entire parameter space and another found after im ...
or
score test In statistics, the score test assesses constraints on statistical parameters based on the gradient of the likelihood function—known as the ''score''—evaluated at the hypothesized parameter value under the null hypothesis. Intuitively, if the ...
.
P-value In null-hypothesis significance testing, the ''p''-value is the probability of obtaining test results at least as extreme as the result actually observed, under the assumption that the null hypothesis is correct. A very small ''p''-value means ...
s generated from comparing the two distributions are then used to identify conserved regions. PhyloHMM uses
hidden Markov model A hidden Markov model (HMM) is a statistical Markov model in which the system being modeled is assumed to be a Markov process — call it X — with unobservable ("''hidden''") states. As part of the definition, HMM requires that there be an ob ...
s to generate probability distributions. The PhyloP software package compares probability distributions using a
likelihood-ratio test In statistics, the likelihood-ratio test assesses the goodness of fit of two competing statistical models based on the ratio of their likelihoods, specifically one found by maximization over the entire parameter space and another found after im ...
or
score test In statistics, the score test assesses constraints on statistical parameters based on the gradient of the likelihood function—known as the ''score''—evaluated at the hypothesized parameter value under the null hypothesis. Intuitively, if the ...
, as well as using a GERP-like scoring system.


Extreme conservation


Ultra-conserved elements

Ultra-conserved element An ultra-conserved element (UCE) was originally defined as a genome segment longer than 200 base pairs (bp) that is absolutely conserved, with no insertions or deletions and 100% identity, between orthologous regions of the human, rat, and mouse ge ...
s or UCEs are sequences that are highly similar or identical across multiple taxonomic groupings. These were first discovered in
vertebrates Vertebrates () comprise all animal taxa within the subphylum Vertebrata () ( chordates with backbones), including all mammals, birds, reptiles, amphibians, and fish. Vertebrates represent the overwhelming majority of the phylum Chordata, ...
, and have subsequently been identified within widely-differing taxa. While the origin and function of UCEs are poorly understood, they have been used to investigate deep-time divergences in
amniote Amniotes are a clade of tetrapod vertebrates that comprises sauropsids (including all reptiles and birds, and extinct parareptiles and non-avian dinosaurs) and synapsids (including pelycosaurs and therapsids such as mammals). They are disti ...
s,
insects Insects (from Latin ') are pancrustacean hexapod invertebrates of the class Insecta. They are the largest group within the arthropod phylum. Insects have a chitinous exoskeleton, a three-part body (head, thorax and abdomen), three pairs of j ...
, and between
animals Animals are multicellular, eukaryotic organisms in the biological kingdom Animalia. With few exceptions, animals consume organic material, breathe oxygen, are able to move, can reproduce sexually, and go through an ontogenetic stage in ...
and
plants Plants are predominantly Photosynthesis, photosynthetic eukaryotes of the Kingdom (biology), kingdom Plantae. Historically, the plant kingdom encompassed all living things that were not animals, and included algae and fungi; however, all curr ...
.


Universally conserved genes

The most highly conserved genes are those that can be found in all organisms. These consist mainly of the
ncRNA A non-coding RNA (ncRNA) is a functional RNA molecule that is not translated into a protein. The DNA sequence from which a functional non-coding RNA is transcribed is often called an RNA gene. Abundant and functionally important types of non-c ...
s and proteins required for
transcription Transcription refers to the process of converting sounds (voice, music etc.) into letters or musical notes, or producing a copy of something in another medium, including: Genetics * Transcription (biology), the copying of DNA into RNA, the fir ...
and
translation Translation is the communication of the Meaning (linguistic), meaning of a #Source and target languages, source-language text by means of an Dynamic and formal equivalence, equivalent #Source and target languages, target-language text. The ...
, which are assumed to have been conserved from the
last universal common ancestor The last universal common ancestor (LUCA) is the most recent population from which all organisms now living on Earth share common descent—the most recent common ancestor of all current life on Earth. This includes all cellular organisms; t ...
of all life. Genes or gene families that have been found to be universally conserved include GTP-binding elongation factors, Methionine aminopeptidase 2,
Serine hydroxymethyltransferase Serine hydroxymethyltransferase (SHMT) is a pyridoxal phosphate (PLP) (Vitamin B6) dependent enzyme () which plays an important role in cellular one-carbon pathways by catalyzing the reversible, simultaneous conversions of L-serine to glycine ...
, and ATP transporters. Components of the transcription machinery, such as
RNA polymerase In molecular biology, RNA polymerase (abbreviated RNAP or RNApol), or more specifically DNA-directed/dependent RNA polymerase (DdRP), is an enzyme that synthesizes RNA from a DNA template. Using the enzyme helicase, RNAP locally opens the ...
and
helicase Helicases are a class of enzymes thought to be vital to all organisms. Their main function is to unpack an organism's genetic material. Helicases are motor proteins that move directionally along a nucleic acid phosphodiester backbone, separatin ...
s, and of the translation machinery, such as
ribosomal RNA Ribosomal ribonucleic acid (rRNA) is a type of non-coding RNA which is the primary component of ribosomes, essential to all cells. rRNA is a ribozyme which carries out protein synthesis in ribosomes. Ribosomal RNA is transcribed from ribosomal ...
s,
tRNAs Transfer RNA (abbreviated tRNA and formerly referred to as sRNA, for soluble RNA) is an adaptor molecule composed of RNA, typically 76 to 90 nucleotides in length (in eukaryotes), that serves as the physical link between the mRNA and the amino a ...
and
ribosomal protein A ribosomal protein (r-protein or rProtein) is any of the proteins that, in conjunction with rRNA, make up the ribosomal subunits involved in the cellular process of translation. ''E. coli'', other bacteria and Archaea have a 30S small subunit an ...
s are also universally conserved.


Applications


Phylogenetics and taxonomy

Sets of conserved sequences are often used for generating
phylogenetic tree A phylogenetic tree (also phylogeny or evolutionary tree Felsenstein J. (2004). ''Inferring Phylogenies'' Sinauer Associates: Sunderland, MA.) is a branching diagram or a tree showing the evolutionary relationships among various biological spec ...
s, as it can be assumed that organisms with similar sequences are closely related. The choice of sequences may vary depending on the taxonomic scope of the study. For example, the most highly conserved genes such as the 16S RNA and other ribosomal sequences are useful for reconstructing deep phylogenetic relationships and identifying bacterial phyla in
metagenomics Metagenomics is the study of genetic material recovered directly from environmental or clinical samples by a method called sequencing. The broad field may also be referred to as environmental genomics, ecogenomics, community genomics or microb ...
studies. Sequences that are conserved within a
clade A clade (), also known as a monophyletic group or natural group, is a group of organisms that are monophyletic – that is, composed of a common ancestor and all its lineal descendants – on a phylogenetic tree. Rather than the English term, ...
but undergo some mutations, such as
housekeeping gene In molecular biology, housekeeping genes are typically constitutive genes that are required for the maintenance of basic cellular function, and are expressed in all cells of an organism under normal and patho-physiological conditions. Although ...
s, can be used to study species relationships. The
internal transcribed spacer Internal transcribed spacer (ITS) is the spacer DNA situated between the small-subunit ribosomal RNA (rRNA) and large-subunit rRNA genes in the chromosome or the corresponding transcribed region in the polycistronic rRNA precursor transcript. I ...
(ITS) region, which is required for spacing conserved rRNA genes but undergoes rapid evolution, is commonly used to classify
fungi A fungus ( : fungi or funguses) is any member of the group of eukaryotic organisms that includes microorganisms such as yeasts and molds, as well as the more familiar mushrooms. These organisms are classified as a kingdom, separately from ...
and strains of rapidly evolving bacteria.


Medical research

As highly conserved sequences often have important biological functions, they can be useful a starting point for identifying the cause of
genetic disease A genetic disorder is a health problem caused by one or more abnormalities in the genome. It can be caused by a mutation in a single gene (monogenic) or multiple genes (polygenic) or by a chromosomal abnormality. Although polygenic disorders ...
s. Many congenital metabolic disorders and
Lysosomal storage disease Lysosomal storage diseases (LSDs; ) are a group of over 70 rare inherited metabolic disorders that result from defects in lysosomal function. Lysosomes are sacs of enzymes within cells that digest large molecules and pass the fragments on to other ...
s are the result of changes to individual conserved genes, resulting in missing or faulty enzymes that are the underlying cause of the symptoms of the disease. Genetic diseases may be predicted by identifying sequences that are conserved between humans and lab organisms such as
mice A mouse ( : mice) is a small rodent. Characteristically, mice are known to have a pointed snout, small rounded ears, a body-length scaly tail, and a high breeding rate. The best known mouse species is the common house mouse (''Mus musculus' ...
or
fruit flies Fruit fly may refer to: Organisms * Drosophilidae, a family of small flies, including: ** ''Drosophila'', the genus of small fruit flies and vinegar flies ** ''Drosophila melanogaster'' or common fruit fly ** ''Drosophila suzukii'' or Asian fruit ...
, and studying the effects of
knock-outs An electrical enclosure is a cabinet for electrical or electronic equipment to mount switches, knobs and displays and to prevent electrical shock to equipment users and protect the contents from the environment. The enclosure is the only part ...
of these genes.
Genome-wide association studies In genomics, a genome-wide association study (GWA study, or GWAS), also known as whole genome association study (WGA study, or WGAS), is an observational study of a genome-wide set of genetic variants in different individuals to see if any varia ...
can also be used to identify variation in conserved sequences associated with disease or health outcomes. In Alzehimer's disease there had been over two dozen novel potential susceptibility loci discovered


Functional annotation

Identifying conserved sequences can be used to discover and predict functional sequences such as genes. Conserved sequences with a known function, such as protein domains, can also be used to predict the function of a sequence. Databases of conserved protein domains such as
Pfam Pfam is a database of protein families that includes their annotations and multiple sequence alignments generated using hidden Markov models. The most recent version, Pfam 35.0, was released in November 2021 and contains 19,632 families. Uses ...
and the
Conserved Domain Database The Conserved Domain Database (CDD) is a database of well-annotated multiple sequence alignment models and derived database search models, for ancient domains and full-length proteins. Philosophy Domains can be thought of as distinct functional ...
can be used to annotate functional domains in predicted protein coding genes.


See also

*
Evolutionary developmental biology Evolutionary developmental biology (informally, evo-devo) is a field of biological research that compares the developmental processes of different organisms to infer how developmental processes evolved. The field grew from 19th-century beginni ...
*
NAPP (database) The Nucleic acid phylogenetic profiling (NAPP) is a database of coding and non-coding sequences according to their pattern of conservation across the other genomes. See also * Conserved sequence In evolutionary biology, conserved sequences are ...
* Segregating site *
Sequence alignment In bioinformatics, a sequence alignment is a way of arranging the sequences of DNA, RNA, or protein to identify regions of similarity that may be a consequence of functional, structural, or evolutionary relationships between the sequences. Alig ...
*
Sequence alignment software This list of sequence alignment software is a compilation of software tools and web portals used in pairwise sequence alignment and multiple sequence alignment. See structural alignment software for structural alignment of proteins. Database searc ...
* UCbase *
Ultra-conserved element An ultra-conserved element (UCE) was originally defined as a genome segment longer than 200 base pairs (bp) that is absolutely conserved, with no insertions or deletions and 100% identity, between orthologous regions of the human, rat, and mouse ge ...


References

{{Use dmy dates, date=April 2017 Computational phylogenetics Nucleic acids Protein structure Population genetics Molecular genetics Evolutionary developmental biology